Following on from a previous post, this article discusses:

  • Assessing how well the GLM generalises to new data with cross validation
  • Automated GLM selection using grid search

Previously, we looked at using H2O's GLM function to set a rating plan for life insurance contracts. Importing and preparing the data, the use of offsets and fitting the GLM with h2o were all covered here.

The previous model was built using a simulated dataset containing 1.5m records. Each record represents an individual and the columns contain their age, gender, occupation, location, salary details, exposure (how long, in years, the individual was insured) and whether or not a claim was made under their policy.

The dataset was split into training data (60%), test data (20%) and holdout data (20%). However, the model we extracted actually only used the training data. Next, I'll use the test and holdout sets, as well as cross validation, to refine the model and check that it actually generalises well to new data.

First, check the performance of the baseline model against the test and holdout datasets - this will allow us to see if any changes we make have a positive effect on the model.

In [4]:
# Check performance of baseline model against holdout
h2o.performance(mod1, newdata = holdout_aggregated)
H2ORegressionMetrics: glm

MSE:  0.1958794
RMSE:  0.4425826
MAE:  0.2249019
RMSLE:  0.2507581
Mean Residual Deviance :  0.458425
R^2 :  0.3866104
Null Deviance :1726.763
Null D.o.F. :3758
Residual Deviance :1723.22
Residual D.o.F. :3748
AIC :2897.024

Adding Regularisation

Now, we can experiment with adding regularisation to see if the model can be improved. A detailed discussion of regularisation is beyond the scope of this post. Effectively though, it serves to simplify the GLM and reduces the chance of overfitting to the training set.

Regularisation involves adding a penalty to the loss function used during coefficient estimation. There are two types of penalty that can be applied, $l_1$ and $l_2$. In both cases, adding a penalty shrinks the model coefficients. However, the shrinkage is applied differently depending on which penalty is used.

The $l_1$ (LASSO) penalty penalises the sum of the absolute coefficients. Coefficients will shrink in absolute terms and therefore when a sufficiently large penalty is applied coefficients will reduce to 0 (reducing parameters to 0 removes them from the model). Essentially, this stops the model fitting to noise in the data by removing certain features.

The $l_2$ (ridge) penalty shrinks coefficients proportionally and does not remove any coefficients from the model.

According to the h2o documentaion, a combination of the $l_1$ and $l_2$ penalty (elastic net) is beneficial because $l_1$ induces sparsity, while $l_2$ gives stability and encourages the grouping effect (where a group of correlated variables tend to be dropped or added into the model simultaneously).

H2O allows us to fit an elastic net GLM by passing additional parameters into the model. The $\lambda$ parameter controls the strength of the penalty we wish to apply and the $\alpha$ parameter controls the distribution between the $l_1$ and $l_2$ penalties.

  • Alpha - a value of 1.0 for alpha represents LASSO, and an alpha value of 0.0 produces ridge reguression.
  • Lambda - The lambda parameter controls the amount of regularization applied. If lambda is 0.0, no regularization is applied, and the alpha parameter is ignored.
  • If no value is specified for alpha and/or lambda, h2o will use default settings. Default alpha = 0.5, default lambda is calculated by h2o.

We don't know beforehand which parameters for $\alpha$ and $\lambda$ will be optimal. Rather than manually trying lots of different values for these parameters it is much easier to use grid search over $\alpha$ in h2o and set lambda_search = TRUE (this will automatically try and select the optimal value for $\lambda$ for a model). Grid search allows us to try lots of different values for $\alpha$ and assess which model performs best against a validation frame.

The following code will build a 'grid' of models. First, we need to define which values of $\alpha$ to try. This is achieved with the seq() function which is defined below to generate a sequence of numbers from 0.01 to 0.95 in increments of 0.01. Trying all of these values for $\alpha$ will result in 96 models being built. These values are stored as a list variable called 'hypers'. Next, instead of h2o.glm() use h2o.grid(). This requires that we specify the algorithm as 'glm'. Also, pass in the 'hypers' list with the hyper_params argument. Note that lambda_search is set to TRUE.

In [6]:
# Add regularisation - does this improve the model?

hypers = list(alpha = seq(0, 0.95, 0.01))

grid = h2o.grid(algorithm = 'glm',
                grid_id = 'GLMgrid',
                x = predictors,
                y = response,
                training_frame = train_aggregated,
                validation_frame = test_aggregated,
                family = 'poisson',
                offset_column = offset,
                lambda_search = TRUE,
                hyper_params = hypers,
                seed = 123)

# sort the grid in terms of mse (low --> high) and pick out the best performing model
sortedGrid = h2o.getGrid('GLMgrid', sort_by = 'mse', decreasing = F)

# Extract the 'best' model
# (i.e the model stored at the top of the grid)
mod2 = h2o.getModel(sortedGrid@model_ids[[1]])
In [7]:
# The sorted grid looks like:
sortedGrid
H2O Grid Details
================

Grid ID: GLMgrid 
Used hyper parameters: 
  -  alpha 
Number of models: 96 
Number of failed models: 0 

Hyper-Parameter Search Summary: ordered by increasing mse
   alpha       model_ids                 mse
1 [0.08] GLMgrid_model_8 0.19413408070930224
2 [0.06] GLMgrid_model_6  0.1941498061682495
3 [0.07] GLMgrid_model_7  0.1941535475088932
4 [0.09] GLMgrid_model_9 0.19416204194644499
5 [0.05] GLMgrid_model_5 0.19416543267945913

---
    alpha        model_ids                 mse
91 [0.01]  GLMgrid_model_1 0.19422795701805917
92 [0.11] GLMgrid_model_11  0.1942491959396608
93 [0.12] GLMgrid_model_12 0.19425718714512488
94 [0.03]  GLMgrid_model_3 0.19425860139279355
95 [0.04]  GLMgrid_model_4 0.19425889315723813
96 [0.13] GLMgrid_model_13 0.19426459133010185
In [8]:
# Extracting the ratings
mod2_ratings = mod2@model$coefficients_table
mod2_ratings$rating = exp(mod2_ratings$coefficients)

# View the table of coefficients
# (some rounding has been applied using dplyr mutate function)
mod2_ratings %>% 
    mutate_if(is.numeric, funs(round(., 3)))
namescoefficientsstandardized_coefficientsrating
Intercept 0.219 0.219 1.245
Salary_Band.1 0.000 0.000 1.000
Salary_Band.2 0.000 0.000 1.000
Salary_Band.3 0.000 0.000 1.000
Salary_Band.4 0.000 0.000 1.000
Salary_Band.5 0.000 0.000 1.000
Location.1 0.000 0.000 1.000
Location.2 0.000 0.000 1.000
Location.3 0.011 0.011 1.011
Location.4 0.000 0.000 1.000
Occupation.1 0.063 0.063 1.065
Occupation.2 0.000 0.000 1.000
Occupation.3 0.000 0.000 1.000
Male.0 0.008 0.008 1.008
Male.1 -0.008 -0.008 0.992
In [9]:
# Check performance of both models against validation & holdout data
perf1 = h2o.performance(mod1, holdout_aggregated)
perf2 = h2o.performance(mod2, holdout_aggregated)

# Check MSE of original model
round(h2o.mse(perf1),5)
0.19588
In [10]:
# And now, MSE of regularised model, automatically 'tuned'
round(h2o.mse(perf2),5)
0.19632

Since the MSE of the new model is larger than that under the original model, the 'tuning' may have actually resulted in overfitting. However, the holdout dataset is only one sample and so it's difficult to draw too many conclusions from this result. This is where cross validation comes in. We can use cross validation to see which model generalises better to new data by examining performance across multiple test datasets.

Cross Validation

So far, the model has only been validated using one test set. K-fold cross validation can be used to check that the $\alpha$ parameter selected is generally sound and that the model will generalise well to a number of unseen data sets. The basic idea is as follows:

  • Partition the data into k subsets
  • Use k-1 subsets as the training data
  • Test the model on the remaining subset
  • Repeat until all subsets have been used as a test set once

This will give us mean performance measures across the k subsets, as well as a standard deviation. The 'best' model (from a number of competing models) might be considered to be the one which has the best mean cross validation performance. The next steps describe how to cross validate an existing model in h2o. The alternative is to set the nfolds argument to be non negative when fitting the models in the first place. However, since this will build n+1 models this may be computationally expensive if using grid search (and depending on the size of the dataset and which model you are fitting - GLM is normally quite fast).

We can select the best performing models in the grid and cross validate each of these to check which has the best mean performance. We can do this using the train and test sets combined. I will use 5 fold cross validation in the following code blocks.

Note - the cross validation performance metrics will not be directly comparable with those computed on the earlier holdout data, since the testing examples in each k fold will be different to those in the holdout set.

In [11]:
# Combine train & test data and cross validate the models at the top
# of the grid.
# Again, use lambda search to choose optimal lambda.

# The following code is based on a tutorial taken from the H2O website
df = h2o.rbind(train_aggregated,test_aggregated)

for (i in 1:5){
  glm <- h2o.getModel(sortedGrid@model_ids[[i]])
  cvglm <- do.call(h2o.glm,
                   {
                     p <- glm@parameters
                     p$model_id = NULL         
                     p$training_frame = df  # no validation frame
                     p$validation_frame = NULL  
                     p$offset_column = 'Offset'
                     p$nfolds = 5          # add the cross validation   
                     p$lambda_search = T
                     p
                   }
  )
  print(glm@model_id)
  print(cvglm@model$cross_validation_metrics_summary[3,]) # print out mse row
}
[1] "GLMgrid_model_8"
Cross-Validation Metrics Summary: 
          mean         sd cv_1_valid cv_2_valid cv_3_valid cv_4_valid
mse 0.39514202 0.02417593 0.41696805  0.4179402 0.39798224 0.41450343
    cv_5_valid
mse 0.32831618
[1] "GLMgrid_model_6"
Cross-Validation Metrics Summary: 
         mean          sd cv_1_valid cv_2_valid cv_3_valid cv_4_valid
mse 0.3952421 0.024238229 0.41700536 0.41825843  0.3980208  0.4146654
    cv_5_valid
mse 0.32826048
[1] "GLMgrid_model_7"
Cross-Validation Metrics Summary: 
         mean          sd cv_1_valid cv_2_valid cv_3_valid cv_4_valid
mse 0.3951722 0.024202611  0.4169403   0.418074  0.3980001  0.4145702
    cv_5_valid
mse 0.32827646
[1] "GLMgrid_model_9"
Cross-Validation Metrics Summary: 
          mean          sd cv_1_valid cv_2_valid cv_3_valid cv_4_valid
mse 0.39519712 0.024199862 0.41699398 0.41802925  0.3979665  0.4146775
    cv_5_valid
mse 0.32831836
[1] "GLMgrid_model_5"
Cross-Validation Metrics Summary: 
          mean         sd cv_1_valid cv_2_valid cv_3_valid cv_4_valid
mse 0.39520955 0.02424613 0.41696036  0.4182716 0.39804438  0.4145749
    cv_5_valid
mse 0.32819656
In [12]:
# Compare these to cross validated version of baseline model:
mod1_cv <- do.call(h2o.glm,
                         
                         {
                           p <- mod1@parameters
                           p$model_id = NULL          
                           p$training_frame = df      
                           p$validation_frame = NULL  
                           p$nfolds = 5              
                           p$offset_column = 'Offset'
                           p
                         }
)
mod1_cv@model$cross_validation_metrics_summary[3,]
meansdcv_1_validcv_2_validcv_3_validcv_4_validcv_5_valid
mse0.40253404 0.0266168710.4350453 0.41589925 0.42091507 0.4119179 0.32889268

Conclusion

It looks like the tuned model performs better if we look at cross validation performance rather than performance against a single holdout set. So, we might determine that the regularised model generalises best to unseen data. The rating plan we extract should be based on this second model.

Regularisation has led to a much simpler model. In fact, the only real loadings picked up are for location 3 and occupation class 1 (the effect of gender is negligible).


Comments

comments powered by Disqus